The dataset contains records of certain checmical properties for red wines and the quality assigned to them. We are going to try to discover any relationships between these variables and the quality of the red wine.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Looking at the summary of the dataset, we can see that variables 2-12 are the chemical properties of the wines and variable 13 is the quality rating. the 1st variable, X is like a ID for the wine record and we can safely ignore that from the anaysis.
We start with ploting univaraiate plots for each variable to look at the distribution and some summary stats.
## [1] "Summary of Quality"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## [1] "Summary of fixed.acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Summary of volatile.acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] "Summary of citric.acid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Summary of residual.sugar"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## [1] "Summary of chlorides"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] "Summary of free.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] "Summary of total.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## [1] "Summary of density"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] "Summary of pH"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] "Summary of sulphates"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] "Summary of alcohol"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Below are some of the observations on the above plots:
density and pH seem to have a normal distribution.
residual.sugar, chlorides, and sulphates seems to have a long tail on the positive side.
fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide and alcohol seem to have an approx poisson distribution.
This dataset has 1599 observations and 13 variables. These 1599 observations correspond to 1599 types of red wines.
Let’s begin with finding the correlation between each independent variable and the depedent variable.
## X fixed.acidity volatile.acidity
## 0.066 0.124 0.391
## citric.acid residual.sugar chlorides
## 0.226 0.014 0.129
## free.sulfur.dioxide total.sulfur.dioxide density
## 0.051 0.185 0.175
## pH sulphates quality
## 0.058 0.251 1.000
Results seems to suggest we none of the indepedent variables have strong correlation with the quality. So, we would need to work with mutiple independent variables to see if we get a stronger correlation with quality.
Not yet. Maybe will update this section, if I do create more variables.
We are going to plot the quality variable as a factor as this will help us make it a classfication problem.
We will plot the variables against quality now. We want to see how various variables change with the quality. We will plot quality on the x axis and each variable by turn on the y axis. We are using boxplots here. We are using a function here to plot all the variables, hence observations will be noted at the end of this section.
However, before plotting, , let’s calculate the correlation between the variables and the quality.
## fixed.acidity volatile.acidity citric.acid
## 0.124 -0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## 0.014 -0.129 -0.051
## total.sulfur.dioxide density pH
## -0.185 -0.175 -0.058
## sulphates alcohol
## 0.251 0.476
None of the variables have a coorelation > 0.500 with quality. The highest score is of alcohol(0.476). We are not going to plot every variable against. We will pick the ones with the highest correlation and plot those, namely, alcohol, volatile.acidity, sulphates, citric.acid.
Out of all the variables, Wine Alcohol(% by volume) has the strongest correlation with Wine quality - 0.476. Lowest qua;ity wines(i.e. with a rating of 3 & 4) has a mean alcohol % less than 11% and the highest quality wines(with a rating of 7 and above) has alcohol % higher then 11%. However, this incfrease in alcohol % with increase in quality rating doesn’t hold for wines with quality rating of 5 where the mean alcohol % is actually lower then wines with quality rating 4.
Volatile Acidity has a -0.391 correlation with Quality. And It’s visible from the plot that Volatile Acidity has a nagatve relationship with quality - Volatile acidity decreases as quality increases.
Sulphates has a correlation of 0.251 with quality. Plot shows a gradual increase in Sulphates as the quality increases.
Citric Acid has a correlation of 0.226 with quality. Plot shows a gradual increase in Citric Acid as the quality increases.
volatile.acidity, density, pH, citric.acid, sulphates and alcohol values change as the quality changes.
## X fixed.acidity volatile.acidity citric.acid
## X 1.000 -0.268 -0.009 -0.154
## fixed.acidity -0.268 1.000 -0.256 0.672
## volatile.acidity -0.009 -0.256 1.000 -0.552
## citric.acid -0.154 0.672 -0.552 1.000
## residual.sugar -0.031 0.115 0.002 0.144
## chlorides -0.120 0.094 0.061 0.204
## free.sulfur.dioxide 0.090 -0.154 -0.011 -0.061
## total.sulfur.dioxide -0.118 -0.113 0.076 0.036
## density -0.368 0.668 0.022 0.365
## pH 0.136 -0.683 0.235 -0.542
## sulphates -0.125 0.183 -0.261 0.313
## alcohol 0.245 -0.062 -0.202 0.110
## residual.sugar chlorides free.sulfur.dioxide
## X -0.031 -0.120 0.090
## fixed.acidity 0.115 0.094 -0.154
## volatile.acidity 0.002 0.061 -0.011
## citric.acid 0.144 0.204 -0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 -0.022
## pH -0.086 -0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 -0.221 -0.069
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.118 -0.368 0.136 -0.125 0.245
## fixed.acidity -0.113 0.668 -0.683 0.183 -0.062
## volatile.acidity 0.076 0.022 0.235 -0.261 -0.202
## citric.acid 0.036 0.365 -0.542 0.313 0.110
## residual.sugar 0.203 0.355 -0.086 0.006 0.042
## chlorides 0.047 0.201 -0.265 0.371 -0.221
## free.sulfur.dioxide 0.668 -0.022 0.070 0.052 -0.069
## total.sulfur.dioxide 1.000 0.071 -0.066 0.043 -0.206
## density 0.071 1.000 -0.342 0.149 -0.496
## pH -0.066 -0.342 1.000 -0.197 0.206
## sulphates 0.043 0.149 -0.197 1.000 0.094
## alcohol -0.206 -0.496 0.206 0.094 1.000
The strongest relationship is between pH and fixed.acidity(0.683)
We are picking alchol and volatile.acidity to plot against the quality as they seem to have the strongest relationship with the quality as compared with other variables. Alcohol has mostly postive relationp with quality where the wines with higher alcohol % are rated higher, quality wise. Volatile Acidity has mostly negative relationship with quality. So putting thee together, wines with higher alcohol % and lower volatile acidity seem to be rated higher quality more often and wines with lower alcohol % and higher voaltile acidity seem to be rated lower quality more often.
Here, we plotted volatile acidity and citric acid colored by quality. As compared to the previous plot (alcohol and volatile acidity), the distinction between low and high quality wines is not as clear here. However, we see that the wines rated higher seem to have higher citric acid and low volatile acidity and the wines rated lower seem to have lower citric acid and higher volatile acidity.
The high quality wines tend to have a higher alcohol content and lower volatile acidity content. Similarly, higher rated wines seem ti have higher citric acid content and lower volatile acidity content.
No, didn’t see any worth mentioning.
Quality is the rating of the wines on a scale of 1-10. There are not wines in the dataset which are rated 1,2, 9 or 10. A very high number of wines are of the medium i.e. quality 5 or 6(> 4/5ths). Also, the Quality distribution is left skewed(slightly).
Out of all the variables, Wine Alcohol(% by volume) has the strongest correlation with Wine quality - 0.476. Lowest qua;ity wines(i.e. with a rating of 3& 4) has a mean alcohol % less than 11% and the highest quality wines(with a rating of 7 and above) has alcohol % higher then 11%. However, this incfrease in alcohol % with increase in quality rating doesn’t hold for wines with quality rating of 5 where the mean alcohol % is actually lower then wines with quality rating 4.
We are picking alchol and volatile.acidity to plot against the quality as they seem to have the strongest relationship with the quality as compared with other variables. Alcohol has mostly postive relationp with quality where the wines with higher alcohol % are rated higher, quality wise. Volatile Acidity has mostly negative relationship with quality. So putting thee together, wines with higher alcohol % and lower volatile acidity seem to be rated higher quality more often and wines with lower alcohol % and higher voaltile acidity seem to be rated lower quality more often.
The dataset comprises of data for 1599 red wines, rated on a quality scale of 1-10. Every record has, apart from the quality rating, 12 variables describing various checmical attributes for that wine. Our goal in this analysis was to find out the relationhips between these checmical attributes and the quality rating of the wines. I begin by examining the variables independently, by looking at their ranges and distributions. I also convreted quality into a factor variable to help with the classification. After that, I calculated the coorelations between each of the independent variables and quality. We didn’t find very strong relationships. Although 2 variables did stand out - alcohol and volatile acidity. Then we plotted the various variables against quality and discovered that Alcohol, volatile acidity, sulphates, and citric acid seem to show some relationship with quality, although not very strong. I chose boxplots to plot the bivariate data, which helped me see the distribution. I then picked up the variables with the highest coorelation vaues with quality and plotted them along with the quality. Here, we some relationship emerging between alcohol, volatile acidity and quality. Earlier, I tried plotting variables against quality using a scatter plot without converting quality into a factor. However, that didn’t produce any insights or distribution. However, after converting quality into a factor variable and using a boxplot, certain distributions quite clear. To conclude, there are lot of ways this analysis can be improved. data about ore wines, espcially low and high quality will help. Maybe, more chemical attributes need to be recorded. Also, applying some kind of Machine Learning algorithm would also help.